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The High Altitude Water Cherenkov (HAWC) Observatory continuously observes gamma-rays 
between 100 GeV to 100 TeV in an instantaneous field of view of about 2 steradians above the 
array. The large amount of raw data, the importance of small number statistics, the large dy¬ 
namic range of gamma-ray signals in time (1 - 10^ sec) and angular extent (0.1 - 100 degrees), 
and the growing need to directly compare results from different observatories pose some spe¬ 
cial challenges for the analysis of HAWC data. To address these needs, we have designed and 
implemented a modular analysis framework based on the method of maximum likelihood. The 
framework facilitates the calculation of a binned Poisson Log-likelihood value for a given physics 
model (i.e., source model), data set, and detector response. The parameters of the physics model 
(sky position, spectrum, angular extent, etc.) can be optimized through a likelihood maximization 
routine to obtain a best match to the data. In a similar way, the parameters of the detector response 
(absolute pointing, angular resolution, etc.) can be optimized using a well-known source such as 
the Crab Nebula. The framework was designed concurrently with the Multi-Mission Maximum 
Likelihood (3ML) architecture, and allows for the definition of a general collection of sources 
with individually varying spectral and spatial morphologies. Compatibility with the 3ML archi¬ 
tecture allows to easily perform powerful joint fits with other observatories. In this contribution, 
we overview the design and capabilities of the HAWC analysis framework, stressing the overar¬ 
ching design points that have applicability to other astronomical and cosmic-ray observatories. 
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1. Introduction 

The High Altitude Water Cherenkov (HAWC) Gamma-ray Observatory is a newly commis¬ 
sioned instrument on volcano Sierra Negra in the state of Puebla, Mexico [1]. Here we describe 
the design and capabilities of a high-level (i.e., event-level) analysis framework. The framework is 
a set of software objects that provides particular functionality for statistical analysis of event-level 
data and that facilitates the development of analysis programs by end users. Event-level data are air 
shower events observed in the HAWC detector that have been reconstructed (i.e., a direction and 
other shower observables have been estimated) and have passed some low-level selection triggers. 
Our framework facilitates statistical analysis using the well-known likelihood formalism [2]. The 
framework supports a model-centric approach for a user to design an analysis program. That is, a 
user of the framework mainly concentrates on the physics model or detector model that she wants 
to test. The software interfaces to the physics model, the detector response, the binned data, and 
the calculation of likelihood values are standardized. The actual details of the physics models, the 
details of the detector response, and the binning of the data are defined by the user through these 
standardized interfaces. 

Some noteworthy design features of the framework are: 

• The physics model definition has no restriction with regard to energy spectrum, source spatial 
morphology, or number of sources. 

• The physics model can be arbitrarily parameterized and any number of parameters can be 
optimized with a maximum likelihood routine. 

• The detector response model can be parameterized enabling details of the detector response 
to be estimated directly from data using a well-known gamma-ray source (e.g., the Crab 
Nebula). 

• The data is binned on the sky using Healpix bins [3]. The data can be binned in an arbitrary 
number of other dimensions (e.g., event quality, energy, etc.) that are defined by fhe user. 
The binned dafa (fhe size of which is fypically 10^ bins, neglecting fhe lime dimension) is 
efficienlly slored on disk and handled in memory. 

• A routine fo calculate a profile maximum likelihood is supported. 

• The framework is fully compafible wifh fhe Mulfi-mission Maximum Likelihood (3ML) ar- 
chifeclure [4] which enables fhe incorporation of multiple insfrumenfs (e.g., HAWC and 
Fermi/LAT) in a join!-likelihood fif. 

In fhe following secfions we elucidafe fhe design fealures of fhe HAWC evenf-level analysis 
framework, also called Likelihood Filling Framework (LiFF). 

2. Likelihood Formalism for Statistical Significance 

The main purpose of fhe framework is lo facililale fhe calculation of fhe likelihood of a par¬ 
ticular physics model and defector response model given an observed set of data. Currently, the 
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framework is designed with the minimum time window as one transit (one day) ^ To substan¬ 
tially reduce the computational workload, we bin the data - we calculate a binned likelihood value. 
HAWC event-level data is binned in a multi-dimensional array of event attributes. For a typical 
HAWC analysis, the array dimensions include the event direction as a set of two sky coordinates 
(e.g., right ascension and declination) and event quality and/or an energy metric. 

The calculation of the logarithm of the likelihood is computationally convenient: 

AUBins 

lnCie-,Nobs) = £ lnif{{Nobs)i\e)), 

/=o 

where 6 is a set of model parameters, {Nobs)i is the number of events in the ith bin, is the 
collection of all {Nobs)i, and / is the probability of observing {Nobs)i given 6. To estimate the 
optimum set of parameters do, we find the 6 that maximizes the value of ln£, or equivalently we 
minimize the value of — ln£. Estimates of the parameter uncertainties and the correlations between 
parameters can be calculated with well known methods [5]. 

To compare hypotheses, the Likelihood Ratio test is used. A test statistic is defined as 

TS-2ln Hypothesis;Np^) 

£(Null Hypothesis; Noi,i) 

When the null-hypothesis is true and in the case of nested models, TS is approximately distributed 
as the distribution with the number of degrees of freedom equal to the difference in the number 
of free parameters between the hypotheses (Wilks’ theorem [6]). For example, the usual source 
discovery test has the null hypotheses as “there is no source” and the alternative hypothesis as 
“there is a source with a flux normalization X”. In this case, the difference between the number of 
free parameters is one (the flux normalization). When the null hypothesis is true, TS is distributed 
as x^{DOF = 1), and \/YS can be interpreted as the significance of fhe over-abundance of events 
in units of "Gaussian sigma". Using the TS formalism enables comparison of arbitrary hypotheses 
with the full rigor of modern frequentist statistics. Note that non-nested models cannot use the 
Wilks’ theorem approximation but there are other methods available [7]. 

When the expected number of signal events in a bin is small (e.g., either due to a weak source, 
or a short observation time, or both), Poisson probability functions must be used in the calculation 
of ln£ or the calculation of the TS will be biased and not easily interpreted. In general, when using 
Poisson probability functions, ln£ is nonlinear with regard to the model parameters 0. Therefore, 
the minimization of the log-likelihood function is performed with a iterative, gradient search algo¬ 
rithm. We use the Minuit [8] minimization package. If we assume a Gaussian probability function 
instead of Poisson, In £ is linear with regard to the flux normalization parameter and can be solved 
algebraically. This method is known as Gaussian weighting in the literature (e.g., [9]). Our frame¬ 
work incorporates a method that implements the Gaussian weighting algorithm. The results of 
the Gaussian weighting algorithm are typically used as a first guess for the minimization of the 
non-linear ln£ function. 

The model parameters 0 can be divided into physics parameters (the parameters of main inter¬ 
est) and nuisance parameters. An example of a nuisance parameter for HAWC is the background 

*We currently have other software routines for dealing with shorter time windows. We plan to expand the capabili¬ 
ties of this framework to shorter time windows in the future. 
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Figure 1: A diagram of the framework architecture. The code writting by the user is typically not complex. 
For example, software objects are instantiated and initialized, then a GetLikelihood method is called. 

normalization (the amount of hadronic events that pass event quality cuts) in a particular bin. A 
profile maximum likelihood takes into account the idea of physics parameters and nuisance pa¬ 
rameters. Two nested loops are implemented where the nuisance parameters are estimated in the 
inner loop and the physics parameters are estimated in the outer loop. In this way, the outer loop 
can incorporate likelihood values for a particular physics model from several different instruments 
with the nuisance parameters associated with each instrument effectively decoupled. This idea is 
part of the 3ML architecture, which is described in detail in [4]. The 3ML idea can be thought 
of as standardizing the statistical language that is used both within a collaboration and between 
collaborations - it has wide applicability. 

3. Software Architecture 

In Figure 1 we show a diagram of the framework architecture and how it is used in the analysis 
of HAWC data. The framework is composed of software objects grouped into five modules: 1.) 
Physics Model, 2.) Defector Response Model, 3.) Binned Dafa, 4.) Likelihood Calculafion, and 
5.) Minimizer. The dependencies of fhe framework are ROOT [10] and HEALPix. In fhe following 
sub-secfions we briefly overview fhe highlighfs of fhe framework componenfs. 

3.1 Binned Event-level Data 

Evenf-level dafa is binned on fhe sky using HEAEPix pixels, which are equal area pixels. A 
sef of HEAEPix pixels fhaf covers fhe whole sky is called a HEAEPix Map. Evenf-level dafa are 
also binned in ofher dimensions. Eor example, we currenfly fypically use 10 evenf-qualify bins. 
The evenfs in a particular evenf-qualify bin have a similar angular resolution and gamma-hadron 
resolution. Thai is, fhe deleclor response is a function of fhe evenf-qualify. Our currenl melric for 
evenf-qualify is fhe number of hif pholomulliplier lubes in fhe array, which is slrongly correlaled 
to fhe energy deposifed in fhe array. One HEAEPix map wilh evenl counls is stored for each 
evenf-qualify bin. 
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An arbitrary number of bin dimensions beyond the two sky coordinates can be defined and 
implemented by the user. Each additional bin, in a dimension other than the sky coordinates, is 
realized as an additional HEALPix Map. Eor example, the data can be binned in both event-quality 
and in a metric which estimates the energy of the primary particle. This energy metric is the energy 
deposited on the array adjusted for core location on the array and zenith angle. Binning in an energy 
metric is important for detailed spectral analysis. If we had 10 event quality bins and 10 energy 
bins, we would have a total of 100 HEAEPix Maps. In this way, a HEAEPix map is a "higher 
dimension bin". 

In a typical analysis using our framework, only the bins that have a signal expectation accord¬ 
ing to the Physics Model and Detector Response Model are summed over in the calculation of ln£. 
Eor a Physics Model that is a point-source, for example, only sky pixels in a region of typically 5° 
radius around the hypothesized source location have to be analyzed, since the point spread function 
of HAWC vanishes for larger angular separations. Thus, we typically only keep a small fraction of 
the total number of sky-pixels in memory at any one time. This significantly lowers our memory 
requirements (important if jobs are to be run in parallel) as well as our disk access time. To handle 
partial sky regions in memory, we developed the SkyMap class that inherits from the HEAEPix 
base class. The SkyMap class stores a limited number of sky pixels where the boundaries of the 
pixel-set can be arbitrary polygonal or circular shapes. 

As a fast and compressed file storage format for HEAEPix maps, we implemented a class 
called MapTree that depends on ROOT object classes. Each pixel value is stored in a TTree object 
with a single branch, where the entry index number is identical to the HEAEPix pixel index, and the 
TTree is stored as a ROOT TFile. We have found that with this on-disk format, writing and reading 
times for a full set of sky pixels is faster than or comparable to the PITS [11] on-disk format. More 
importantly, the writing and reading of partial sky regions are substantially faster. 

Our time bin, the length of time for which data is collected to generate a single set of HEAEPix 
maps, is an arbitrary time period. The maximum length of the time bin is typically determined by 
the duration of a stable detector configuration, relevant in particular during the construction phase. 
To facilitate time-dependent analyzes like in [12], the current minimum time interval is one sidereal 
day. 


3.2 Physics Model 

The physics model is implemented as a collection of one or more sources. We define two types 
of sources: 1.) A point source, and 2.) An extended source. The point source has a location and a 
spectrum. A general spectrum is defined by a table of differential flux values, binned in arbitrary 
energy intervals. Eor many analyzes, the spectrum can also be described by a simple function, for 
example a power law, in which case the spectrum is stored as a TFl object [10]. The extended 
source has a location, a spectrum, and a spatial morphology. The spatial morphology interface is 
still under development but will eventually allow arbitrary tabular or functional inputs describing 
flux intensity and spectra as a function of location. 

This implementation gives the user a tremendous amount of freedom in setting the physics 
model. Note that each source can have an independent spectrum and that point sources and ex¬ 
tended sources can be implemented in the same physics model. Some examples of physics models 
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that can be implemented and tested are a closely spaced double source, a soft-spectrum source 
embedded in a hard-spectrum extended source, and the diffuse emission from the galactic plane. 

3.3 Detector Response Model 

If the time bin is an integer number of sidereal days, then the detector response is a func¬ 
tion of the declination of the source but not the right ascension. We define the detector response 
as a function of the declination and the source spectrum. Given a source declination and arbi¬ 
trary source spectrum the code within the Detector Response Module calculates the expected point 
spread functions (how events are spread on the sky due to detection limitations) for each of the 
higher-dimension bins and the number of events in each of the higher-dimension bins. This infor¬ 
mation is then used by code within the Likelihood Calculation Module to determine the expected 
number of signal events in each bin for a particular physics model. 

The description of the detector response is realized as a collection of histograms {THl objects 
[10]). These histograms are generated from simulation using a particular spectrum. If the source 
spectrum called for in the physics model is different from the spectrum used to generate the orig¬ 
inal detector response, the detector response is modified by re-weighting the energy distribution 
histograms for each higher dimension bin. This is a substantial computational short cut and is ac¬ 
curate as long as the spectrum called for in the physics model is not too different from the specttum 
used to generate the original detector response. 

There is functionality such that the detector response can be parameterized and these param¬ 
eters can be free in the minimization of — ln£. This allows for the determination of the detector 
response directly from the data. For example, if we assume that the spatial extent of the Crab Neb¬ 
ula is much smaller than our point spread function, we can determine the point spread function that 
is most consistent with the collection of events with directions near the Crab [13]. 

3.4 Likelihood Calculation 

To calculate ln£ we need to know the expected number of signal events and the expected 
number of background events for each bin. The expected number of signal events in each relevant 
bin (bins that have zero expected source events are not included) is calculated using the physics 
model parameters set by the user and the relevant detector response information. The expected 
number of background events (i.e., hadron events that have passed our event quality selections) 
are normally not estimated from simulation but rather directly from the data. The background is 
either estimated in a prior step that occurs when the data is first binned [14], or the background is 
estimated as a nuisance parameter as part of the minimization of — In C. 

Then to calculate the log likelihood using Poisson probability: 

AllBins 

{Nobs^^^exp Nexp 0 ) • 

Note that calculating (ln(x!)) is numerically efficient (e.g., using Stirling’s approximation). 

A minimization of — ln£ can be performed with an arbitrary number of free parameters. Two 
nested minimization loops are supported. This enables the calculation of a profile likelihood where 
nuisance paramefers (e.g., fhe expected number of background evenfs in each bin) are esfimafed 
in fhe inner loop and fhe parameters associafed wifh fhe physics model are esfimafed in fhe outer 
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loop. The calculation of such a profile likelihood is key to incorporating multiple instruments (e.g., 
HAWC and Fermi/LAT) in a joint-Likelihood fit. 

4. Conclusions 

The HAWC Collaboration has designed and coded a software framework for the statistical 
analysis of event-level data. The likelihood formalism is used. With the addition of a small amount 
of code, a user can efficiently explore any number of complex hypotheses. The framework is 
model-centric in that the user mainly focuses on the design of the physical model that she wants 
to confront with data. The complexities of handling the binned data, the detector response, the 
calculation of In C, and the minimizations are handled by standardized software interfaces. 

The framework was designed along side and is fully compatible with the 3ML architecture. 
The 3ML architecture allows for the incorporation of multiple instruments in a joint-likelihood tit 
and is a powerful addition to the broader field of astronomy and cosmic-ray physics. 
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